Promoting data harmonization to evaluate vaccine hesitancy in LMICs: approach and applications

Background Factors influencing the health of populations are subjects of interdisciplinary study. However, datasets relevant to public health often lack interdisciplinary breath. It is difficult to combine data on health outcomes with datasets on potentially important contextual factors, like political violence or development, due to incompatible levels of geographic support; differing data formats and structures; differences in sampling procedures and wording; and the stability of temporal trends. We present a computational package to combine spatially misaligned datasets, and provide an illustrative analysis of multi-dimensional factors in health outcomes. Methods We rely on a new software toolkit, Sub-National Geospatial Data Archive (SUNGEO), to combine data across disciplinary domains and demonstrate a use case on vaccine hesitancy in Low and Middle-Income Countries (LMICs). We use data from the World Bank’s High Frequency Phone Surveys (HFPS) from Kenya, Indonesia, and Malawi. We curate and combine these surveys with data on political violence, elections, economic development, and other contextual factors, using SUNGEO. We then develop a stochastic model to analyze the integrated data and evaluate 1) the stability of vaccination preferences in all three countries over time, and 2) the association between local contextual factors and vaccination preferences. Results In all three countries, vaccine-acceptance is more persistent than vaccine-hesitancy from round to round: the long-run probability of staying vaccine-acceptant (hesitant) was 0.96 (0.65) in Indonesia, 0.89 (0.21) in Kenya, and 0.76 (0.40) in Malawi. However, vaccine acceptance was significantly less durable in areas exposed to political violence, with percentage point differences (ppd) in vaccine acceptance of -10 (Indonesia), -5 (Kenya), and -64 (Malawi). In Indonesia and Kenya, although not Malawi, vaccine acceptance was also significantly less durable in locations without competitive elections (-19 and -6 ppd, respectively) and in locations with more limited transportation infrastructure (-11 and -8 ppd). Conclusion With SUNGEO, researchers can combine spatially misaligned and incompatible datasets. As an illustrative example, we find that vaccination hesitancy is correlated with political violence, electoral uncompetitiveness and limited access to public goods, consistent with past results that vaccination hesitancy is associated with government distrust. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02088-z.

We obtain estimates for these covariates by (a) using SUNGEO's R package to geocode the locations of survey sampling units, and (b) using this geolocation to spatially match each household to its parent administrative unit.Figure B1.1 illustrates this geoprocessing strategy, with the example of Indonesia.Table B1.4 reports survey sample attrition statistics.The "Attrition" column shows the number (and percent) of households who dropped out of the sample between each pair of rounds.We use an apostrophe (') to denote rounds that we excluded from our main analysis because their survey questionnaires did not include questions about vaccine hesitancy.To assess whether respondents who dropped out of the sample between rounds systematically differ from those who stayed, the table also includes standardized differences in means for the covariates in Tables B1.1-B1.3,including both the average standardized difference across all covariates, and the minimum and maximum values.
The statistics in this table suggest that sample attrition is potentially most problematic for our Kenya data, where drop rates are between 12 and 27 percent.On the other end of the spectrum are our data for Indonesia, where drop rates range from 3 to 9 percent.The largest standardized difference in means is half a standard deviation, for Malawi rounds 8 and 9 (the share of female respondents rose from 21 percent to 41 percent).Note, however, that we do not use data on Malawi's round 8 in our analyses.Most standardized differences, however, are well below .25 standard deviations, indicating that respondents who dropped out of these samples were quite similar on observables to those who remain.

B2 Main Analyses
The current section reports the full set of estimation results from the models used to generate the stationary distributions in Figures 2-4 in the main text.Out main model specification takes the following form: where y it is equal to 1 if respondent i expressed an intent to obtain the Covid-19 vaccine in round t, and 0 if i did not express such an intent.y i,t´1 , is a first-order temporal lag of this indicator for the previous round.
x i is a set of covariates, which include both respondent-level attributes (age, sex) and the social, economic and political characteristics of the respondent's local geographic environment (electoral competitiveness, exposure to political violence, road density, night light intensity, urbanization, ethnolinguistic fractionalization, geographic terrain).α jris are fixed effects corresponding to the first-order administrative units (e.g.provinces) j P 1, . . ., J in which households i P 1, . . ., N are located.1 τ t are fixed effects for survey rounds t P 1, . . ., T .
it are robust standard errors, clustered by province and survey round.θ 0 is the set of regression coefficients for households that had previously expressed no intent to get vaccinated (y i,t´1 " 0), and θ 1 " θ 0 `γ is the set of coefficients for households that did express an intent to get vaccinated (y i,t´1 " 1).We use a logit link function to relate the covariates x to the corresponding transition probabilities Pr i,t p"no" Ñ "yes"q, Pr i,t p"yes" Ñ "yes"q.We use the predicted probabilities from this model to construct transition probability matrices under various counterfactual scenarios, and obtain stationary distributions of these transition matrices through eigenvalue decomposition.
To make the sample more closely resemble a simple random draw of each country's population, we weight each household observation with sampling weights provided by WBG HFPS.These weights are "based on the inclusion probabilities of the cell phones and landlines through which [respondents] can be reached," along with first-time and attrition non-response weighting adjustments, and calibration with auxiliary information on regional population size, respondent sex, age group, and educational attainment (Flores Cruz, 2022).Table B2.5 reports the raw θ0 and γ estimates for equation (1), and model fit diagnostics.

B2.2 Transition Probability Matrices
Because the numerical estimates in Table B2.5 can be difficult to interpret on their own, Figures B2.6-B2.8report transition probability matrices derived from the predicted probabilities of the models in Table B2.5.Figures B2.9-B2.11show the stationary distributions of these transition matrices, based on an eigenvalue decomposition.
no yes no 0.65 0.35 yes 0.04 0.96 Table B2.6:2ˆ2 right-stochastic matrix of vaccine intent (median household in Indonesia).Values represent predicted probabilities that a household originally in state i (row) transitioned to state j (column) across rounds.States include: 'no' (don't intend to take vaccine), 'yes' (do intend).
no yes no 0.21 0.79 yes 0.11 0.89 Table B2.7:2ˆ2 right-stochastic matrix of vaccine intent (median household in Kenya).Values represent predicted probabilities that a household originally in state i (row) transitioned to state j (column) across rounds.States include: 'no' (don't intend to take vaccine), 'yes' (do intend).Table B2.5:Regression coefficients.Outcome is expressed willingness to take vaccine.Fixed effect GLM (logit) coefficient estimates, clustered robust standard errors in parentheses.

B2.3 Stationary Distributions for Other Covariates
Figures B2.2-B2.4report additional simulated stationary distributions, for all covariates besides those already discussed in the main text (competitiveness, violence and road density).Each sub-plot compares the stationary distribution under two counterfactual scenarios, where the focal covariate takes a particular value, while all other covariates are held constant at their median values.2For continuous covariates, the labels "high" and "low" correspond to the 99th and 1st percentiles (see Tables B1.1-B1.3).For example, Figure B2.2a suggests that the long-run probability of saying "yes" to the vaccine is higher for younger survey respondents: it is 0.96, on average, for respondents in the 99th age percentile of our Indonesian sample (70 years old), and 0.75 for those in the 1st percentile (18 or younger).Figures B2.2-B2.4highlight several patterns that are consistent across the three countries, as well as several points of cross-national variation.In all three countries, older respondents are more hesitant to take the vaccine.Respondents who lived in more ethnically fractionalized areas were also consistently more hesitant to get the vaccine than respondents in less fractionalized areas.
Other covariates vary in importance across countries.For example, while Kenyan and Malawian respondents expressed more willingness to obtain the vaccine in more economically active locations (as proxied by luminosity), this pattern did not hold in Indonesia.While urbanization appears to be a strong predictor of vaccine hesitancy in Indonesia and Malawi, we observe no such relationship in Kenya.
Finally, there are several covariates whose relationships to vaccine hesitancy vary across countries not only in statistical significance, but also in direction.While male respondents are more willing to take the vaccine in Indonesia and Malawi, the opposite is true in Kenya.While rugged terrain is correlated with greater vaccine acceptance in Indonesia and Kenya, the opposite is true in Malawi.

B3.1 Alternative Sources and Measures
One of the advantages of SUNGEO as a platform for empirical research is the availability of multiple data sources and measures for the same theoretical constructs.For example, our main analysis employs specific measures of political violence and electoral competitiveness.Yet alternative data sources and measures for these variables do exist, each of which may represent a slightly different quantity of interest, and which may be subject to idiosyncratic forms of measurement error and bias.To gauge how sensitive our analyses are to these choices, we re-estimate the model in equation ( 1) with alternative data sources and measures, and report the resulting simulated stationary distributions below.In the case of political violence, these analyses suggest that the direction of the relationship to vaccine hesitancy is mostly consistent across data sources (less violence Ñ more willingness to take the vaccine), with several exceptions: UCDP-GED for Indonesia, SCAD for Kenya, and ACLED for Malawi.AIC statistics indicate that NVMS offers the best model fit in Indonesia, UCDP-GED offers the best fit for Kenya, and SCAD offers the best fit for Malawi.Consequently, these are the sources we employ in our main analyses.
In the case of electoral competitiveness, the two measures yield numerically similar estimates in all cases.AIC statistics indicate that in 2 of 3 cases Top-1 Competitiveness provides a superior model fit.For this reason, our main analyses employ the Top-1 measure.

B3.2 Adjustments for Autocorrelated Errors
Our main analysis includes two-way clustered standard errors, which account for nonindependence of survey observations in the same administrative unit and survey round.While conservative, this approach do not directly account for other types of potential spatial and temporal dependencies.First, within and across administrative areas, households in nearby locations may be more similar in their survey responses than households in more distant locations.Second, responses taken in consecutive survey rounds may be more similar to each other than responses from more distant, non-consecutive survey rounds.Both types of autocorrelation may lead to biased or inefficient estimates of relationships between local contextual factors and vaccine hesitancy.The current section reports estimates that directly adjust for spatially and termporally autocorrelated residuals.

B3.2.1 Spatial Autocorrelation
Figures B3.11-B3.13report simulated stationary distributions with Conley (1999) standard error estimates, which correct for spatial correlation among respondent locations that fall within a set distance of each other.By way of a cutoff, we used the median distance from each primary sampling unit to its 5 nearest neighbors (80 km for Indonesia, 104 km for Kenya, 26 km for Malawi).This cutoff ensures that no sampling location is treated as a geographic isolate, and that each location is grouped with up to five others in its immediate geographic neighborhood.While estimates for Indonesia lose significance, the remaining results appear robust to this adjustment.

B3.2.2 Temporal Autocorrelation
To account for potential temporal dependence across rounds, Figures B3.14-B3.16consider additional models with a nonparametric Driscoll-Kray time-series covariance matrix estimator, while is robust to general forms of cross-sectional and temporal dependence (in our case, up to one time lag).As with the Conley standard errors, this design yields wider confidence intervals than our main specification, particularly for Indonesia, indicating that autocorrelation may drive at least some of our results.

B3.3 Matched Analysis
One of several potential barriers to inference in our analysis is the possibility that our survey sample may be imbalanced on key contextual factors.Our main analyses use survey sampling weights (i.e.inverse probabilities of selection) to help approximate a random draw of the population.However, this re-weighting strategy does not address selection on dimensions beyond participant demographics.This problem may be particularly acute with respect to political violence.For example, survey sampling weights may account for the possibility that households in areas exposed to high levels of violence may be more difficult and costly for survey teams to reach, including by telephone.Yet even after accounting for their inclusion probability, these respondents may still differ from others in the sample in ways beyond their exposure to violence (e.g.differences in access to economic opportunities, access to infrastructure, differences in the local ethnolinguistic environment, type of geographic terrain).Some of these differences may also be relevant to vaccination, confounding our ability to asses the relationship between violence and vaccine hesitancy.
To alleviate some of these concerns, we used statistical matching to create re-weighted survey samples in which respondents with high exposure to violence were as similar as possible to respondents exposed to a lower level of violence.While matching is an adjustmentbased solution that cannot facilitate causal inference without making quite onerous identifying assumptions like the absence of unmeasured and unobserved confounders this approach can reduce model dependence by down-weighting outliers and other influential observations, and preventing extrapolation outside the range of available data.
For each country, we applied three types of matching solutions: (1) propensity scores, which matches observations with a similar predicted probability of being selected into treatment (Rosenbaum and Rubin, 1983), (2) Mahalanobis distance, which seeks to minimize a scale-invariant distance between pre-treatment covariates, while taking into account correlation between the covariates (Sekhon, 2011), and (3) genetic matching, an extension of multivariate matching that uses an evolutionary search algorithm to determine the weight for each covariate (Sekhon and Diamond, 2013).We dichotomize the political violence variable for this purpose by labeling administrative units exposed to above-average levels of violence as "high violence" and those below the mean as "low violence."The matching covariates are the same ones as in the matrix x in equation ( 1), excluding violence.
Table B3.12 summarizes covariate balance statistics before and after matching, for all three countries and all matching solutions.The metric we report here is standardized difference, or the absolute difference in means between "treated" (high violence) and "control" (low violence) units, divided by the standard deviation of the "treated" group.While there are no universally-accepted criteria for assessing improvements in balance, standardized bias of 0.25 and lower is a common standard in social science (Ho et al., 2007).The table reports averages of standardized bias, across all covariates.
For Indonesia, the matching solution with the greatest improvement in balance was Mahalanobis distance.For Kenya, propensity score matching out-performed the rest.In Malawi, no matching solution generated a major improvement in balance standardized differences remained unacceptably high with Mahalanobis distance in a narrow lead.
Figures B3.17-B3.19report the simulated stationary distributions, re-estimated on each matched sample.While numerical estimates diverge significantly from those in the full sample, the direction of the relationship is at least for Indonesia and Kenya the same as before.The estimates for Malawi are a clear exception, although we urge caution in reading too deeply into them, given the significant remaining imbalance and small sample size reported in Table B3.12.In our main analyses, we found that older respondents were more hesitant to take the vaccine, while male respondents were (generally) less vaccine-hesitant than female respondents (see Figures B2.2-B2.4).One possibility is that exposure to violence dampens these relationships, by reducing the salience of infectious disease as an immediate threat to life among directly affected groups (i.e.military-age males).An alternative possibility is that exposure to violence amplifies these relationships, creating starker differences in expressed opinion across age groups and sexes.To evaluate the relative merits of these two arguments, we adopt a slightly different model specification: Figures B3.20 and B3.21 report several sets of estimates from these cross-level interaction models.The point values represent differences in predicted probabilities under two types of counterfactual scenarios (i.e.younger Ñ older, female Ñ male), broken down by level  # Get list of available data for a single country : > info _ 2 <-get _ info ( country _ names = " Afghanistan " ) > info _ 2 [ " summary " ] > info _ 2 [ " topics " ] > info _ 2 [ " geosets " ] # Get list of available data for a single topic : > info _ 3 <-get _ info ( topics = " Elections : LowerHouse : CLEA " ) > info _ 3 [ " summary " ] > info _ 3 [ " topics " ] # Get list of available data for a multiple countries and topics : > info _ 4 <-get _ info ( + country _ names = c ( " Afghanistan " ," Zambia " ) , + topics = c ( " Elections : LowerHouse : CLEA " , + " Events : Pol iticalVi olence : GED " )) > info _ 4 [ " summary " ] Let's try downloading some data.For example, here is a query for a single country and single topic: # Population data for Afghanistan : > data _ 1 <-get _ data ( + country _ name = " Afghanistan " , + topics = " Demographics : Population : GHS " ) [ 1 ] " Fetching ... " [ 1 ] " Combining ... " Time difference of 8 .0 5 9 2 1 5 secs > str ( data _ 1 ) Classes ' data .table ' and ' data .frame ': 9 5 2 obs . of 3 4 variables : ...

Figure B1. 1 :
Figure B1.1:Integration of Survey Data with Contextual Data on Violence, Elections and Road Infrastructure (Indonesia)

Figure
Figure B2.2:Additional Counterfactual Stationary Distributions, Indonesia (a) Respondent's Age Figures B3.5-B3.7 report counterfactual stationary distributions with alternative data sources for political violence assembled by the Cross-National Data on Sub-National Violence (xSub) project, including the Armed Conflict Location and Event Data Project (ACLED), the National Violence Monitoring System (NVMS), the Social Conflict Analysis Database (SCAD), and the UCDP Georeferenced Event Dataset (UCDP-GED).Figures B3.8-B3.10report stationary distributions for electoral competitiveness, using two alternative measures for lower-house parliamentary elections from the Constituency-Level Elections Archive: the Top Party Competitiveness Score (Top-1), and the Top-Two Party Competitiveness Score (Top-2).The Figures also report Akaike Information Criteria (AIC) for the models used for each simulation result, where lower values indicate lower deviance.

Figure
Figure B3.17:Estimates with Matched Samples, Political Violence, Indonesia (a) Genetic Matching

Table B2
.9: Stationary distribution of 2ˆ2 right-stochastic matrix (median household in Indonesia).Values represent long-term probabilities that a household ends up in each state, irrespective of initial distribution.States include: 'no' (don't intend to take vaccine), 'yes' (do intent to get vaccine).

Table B2
.10: Stationary distribution of 2ˆ2 right-stochastic matrix (median household in Kenya).Values represent long-term probabilities that a household ends up in each state, irrespective of initial distribution.States include: 'no' (don't intend to take vaccine), 'yes' (do intent to get vaccine).
B3.4 Cross-Level InteractionsOne possibility that our main model specification does not directly consider is that individuallevel characteristics, such as age or sex, might interact with local contextual factors in ways that are relevant to vaccine hesitancy.For example, political violence can have profound demographic, social and economic effects on communities exposed to it.Armed conflict can shape the age distribution of exposed population, by lowering life expectancy and labor force participation rates among military-age males.It can also affect gender relations, in the direction of both empowerment and subjugation: war can increase women's participation in the labor force, albeit temporarily (e.g.World War II), yet it can also place new restrictions on women's social roles and economic opportunities (e.g.Syrian Civil War).